Action for Happiness is a global movement aimed at increasing well-being and contentment in people everywhere. They offer tools, resources, and courses based on the most recent research in science aimed at helping individuals live happier lives.
I, Patrick Chaccour, a hired Data Consultant, was handed a dataset, which includes country rankings based on population happiness. These rankings consider a variety of characteristics, including GDP per capita, healthy life expectancy, generosity, freedom... With this Dataset I will conduct an exploratory data analysis (EDA), to aid Action for Happines reach their objectives.
The corporation's stakeholders, such as policymakers, researchers, and members of the general population interested in marketing happiness on a worldwide scale, are among the intended targeted audience.
The goal of the EDA is to provide information regarding the relationship between these parameters and the overall happiness of populations in various countries. Precisely, by examining the data, we will identify the countries that need particular interventions. For instance, if the study reveals that a given location has low levels of happiness, Action for Happiness can create specialized programs or efforts to meet those needs. This EDA will hand Actions for Happiness evidence and insights and will enable them to make Data-Driven Decisions. They will also use the pipeline to look into trends, test theories, and assess the efficacy of various approaches.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
df_originial = pd.read_csv('2021.csv')
df_originial.head()
| Country name | Regional indicator | Ladder score | Standard error of ladder score | upperwhisker | lowerwhisker | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Ladder score in Dystopia | Explained by: Log GDP per capita | Explained by: Social support | Explained by: Healthy life expectancy | Explained by: Freedom to make life choices | Explained by: Generosity | Explained by: Perceptions of corruption | Dystopia + residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Finland | Western Europe | 7.842 | 0.032 | 7.904 | 7.780 | 10.775 | 0.954 | 72.0 | 0.949 | -0.098 | 0.186 | 2.43 | 1.446 | 1.106 | 0.741 | 0.691 | 0.124 | 0.481 | 3.253 |
| 1 | Denmark | Western Europe | 7.620 | 0.035 | 7.687 | 7.552 | 10.933 | 0.954 | 72.7 | 0.946 | 0.030 | 0.179 | 2.43 | 1.502 | 1.108 | 0.763 | 0.686 | 0.208 | 0.485 | 2.868 |
| 2 | Switzerland | Western Europe | 7.571 | 0.036 | 7.643 | 7.500 | 11.117 | 0.942 | 74.4 | 0.919 | 0.025 | 0.292 | 2.43 | 1.566 | 1.079 | 0.816 | 0.653 | 0.204 | 0.413 | 2.839 |
| 3 | Iceland | Western Europe | 7.554 | 0.059 | 7.670 | 7.438 | 10.878 | 0.983 | 73.0 | 0.955 | 0.160 | 0.673 | 2.43 | 1.482 | 1.172 | 0.772 | 0.698 | 0.293 | 0.170 | 2.967 |
| 4 | Netherlands | Western Europe | 7.464 | 0.027 | 7.518 | 7.410 | 10.932 | 0.942 | 72.4 | 0.913 | 0.175 | 0.338 | 2.43 | 1.501 | 1.079 | 0.753 | 0.647 | 0.302 | 0.384 | 2.798 |
Columns Expained:
Cleaning the Dataset:
The Datasets are ranked from happier countries to less happier ones. Our objective is to aid the weak and depressed. So we must flip the dataset.
Drop all unnecessary columns
Unlog the logged GDP per capita so we can use them with more ease.
Create Sub-datasets
#Flipping the Dataset
df = df_originial[::-1].reset_index(drop=True)
#Dropping all unnecessary columns
columns_to_drop = ['Generosity', 'Ladder score in Dystopia', 'Explained by: Log GDP per capita',
'Explained by: Social support', 'Explained by: Healthy life expectancy',
'Explained by: Freedom to make life choices', 'Explained by: Perceptions of corruption',
'Dystopia + residual']
df = df.drop(columns_to_drop, axis=1)
# Used ChatGPT to anti log the GDP per capita
df['Logged GDP per capita'] = np.exp(df['Logged GDP per capita'])
#Renaming Columns
df = df.rename(columns= {'Standard error of ladder score': 'Standard margin of Error',
'Ladder score': 'Happiness Score',
'upperwhisker': 'Highest Score',
'lowerwhisker': 'Lowest Score',
'Logged GDP per capita': 'GPD per capita',
'Freedom to make life choices': 'Freedom',
'Explained by: Generosity': 'Perception of Generosity'})
df.head()
| Country name | Regional indicator | Happiness Score | Standard margin of Error | Highest Score | Lowest Score | GPD per capita | Social support | Healthy life expectancy | Freedom | Perceptions of corruption | Perception of Generosity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | South Asia | 2.523 | 0.038 | 2.596 | 2.449 | 2197.333810 | 0.463 | 52.493 | 0.382 | 0.924 | 0.122 |
| 1 | Zimbabwe | Sub-Saharan Africa | 3.145 | 0.058 | 3.259 | 3.030 | 2815.795236 | 0.750 | 56.201 | 0.677 | 0.821 | 0.157 |
| 2 | Rwanda | Sub-Saharan Africa | 3.415 | 0.068 | 3.548 | 3.282 | 2155.978587 | 0.552 | 61.400 | 0.897 | 0.167 | 0.227 |
| 3 | Botswana | Sub-Saharan Africa | 3.467 | 0.074 | 3.611 | 3.322 | 17712.041536 | 0.784 | 59.269 | 0.824 | 0.801 | 0.027 |
| 4 | Lesotho | Sub-Saharan Africa | 3.512 | 0.120 | 3.748 | 3.276 | 2768.331303 | 0.787 | 48.700 | 0.715 | 0.915 | 0.103 |
#Asked chatgpt to create a subset of the first 10 and last 10 rows of the original datset
first_10_rows = df.head(10)
last_10_rows = df.tail(10)
subset = pd.concat([first_10_rows, last_10_rows])
We must examine the data for missing values, outliers, and discrepancies. It is also critical to grasp the data types and variable distributions.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 149 entries, 0 to 148 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 149 non-null object 1 Regional indicator 149 non-null object 2 Happiness Score 149 non-null float64 3 Standard margin of Error 149 non-null float64 4 Highest Score 149 non-null float64 5 Lowest Score 149 non-null float64 6 GPD per capita 149 non-null float64 7 Social support 149 non-null float64 8 Healthy life expectancy 149 non-null float64 9 Freedom 149 non-null float64 10 Perceptions of corruption 149 non-null float64 11 Perception of Generosity 149 non-null float64 dtypes: float64(10), object(2) memory usage: 14.1+ KB
df.isnull().sum()
Country name 0 Regional indicator 0 Happiness Score 0 Standard margin of Error 0 Highest Score 0 Lowest Score 0 GPD per capita 0 Social support 0 Healthy life expectancy 0 Freedom 0 Perceptions of corruption 0 Perception of Generosity 0 dtype: int64
# To obtain summary statistics of the data set, i used this code to pick numeric columns from the Dataframe
# and generate a new one that solely contains numeric columns, and call the the function decribe() for the summary.
numeric_columns = df.select_dtypes(include=[np.number])
numeric_columns.describe()
| Happiness Score | Standard margin of Error | Highest Score | Lowest Score | GPD per capita | Social support | Healthy life expectancy | Freedom | Perceptions of corruption | Perception of Generosity | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 | 149.000000 |
| mean | 5.532839 | 0.058752 | 5.648007 | 5.417631 | 21560.608440 | 0.814745 | 64.992799 | 0.791597 | 0.727450 | 0.178047 |
| std | 1.073924 | 0.022001 | 1.054330 | 1.094879 | 20908.784656 | 0.114889 | 6.762043 | 0.113332 | 0.179226 | 0.098270 |
| min | 2.523000 | 0.026000 | 2.596000 | 2.449000 | 761.279066 | 0.463000 | 48.478000 | 0.382000 | 0.082000 | 0.000000 |
| 25% | 4.852000 | 0.043000 | 4.991000 | 4.706000 | 5120.462265 | 0.750000 | 59.802000 | 0.718000 | 0.667000 | 0.105000 |
| 50% | 5.534000 | 0.054000 | 5.625000 | 5.413000 | 14314.095070 | 0.832000 | 66.603000 | 0.804000 | 0.781000 | 0.164000 |
| 75% | 6.255000 | 0.070000 | 6.344000 | 6.128000 | 33556.974347 | 0.905000 | 69.600000 | 0.877000 | 0.845000 | 0.239000 |
| max | 7.842000 | 0.173000 | 7.904000 | 7.780000 | 114347.804564 | 0.983000 | 76.953000 | 0.970000 | 0.939000 | 0.541000 |
numeric_columns.hist(bins=20, figsize=(10, 6))
#Asked ChatGpt to Make the histograms fit properly, without overlapping
plt.tight_layout()
plt.show()
To extract insights from data, we can utilize visualization techniques such as scatter plots, bar charts, box plots, heatmaps, and geographical maps. Each insight should be accompanied by appropriate visualizations and a compelling story that illustrates why the insight is important to the business.
# Isolated the top 50 rows
df_top50 = df[:50]
# Plotted a pie chart to show in what region of the world do the majority of the sad countries lie.
fig, ax = plt.subplots(figsize=(6, 6))
df_top50['Regional indicator'].value_counts().plot(kind='pie')
plt.title('Distribution of top 50 Saddest Countries by Region')
plt.show()
This Pie Chart indicates that the majority of the sadder countries lie in the Sub-Saharan, North African, Middle Eastern and South Asian Countries. Which means Action for Happiness must focus on projects to aid such Regions.
However, to help such devistated regions, we must find out what is causing this. We must compare both sad and happy countries and understand what is going on, so we could fix it.
#Used Chatgpt to add a trendline to the scatter plot
coefficients = np.polyfit(df['Freedom'], df['Happiness Score'], 1)
m = coefficients[0]
c = coefficients[1]
#Customized the Scatter Plot
plt.scatter(df['Freedom'],df['Happiness Score'], color = 'blue')
plt.plot(df['Freedom'], m*df['Freedom'] + c, color='red', label='Trendline')
plt.xlabel('Percentage of Freedom')
plt.ylabel('Happiness Score')
plt.title('Does Freedom affect happiness?')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
The data reveals a favorable relationship between countries with greater freedom, including freedom of expression, choice, and dress, and overall pleasure. This statement is supported by the data's upward trend. It is worth mentioning, however, that despite some locations demonstrating great levels of freedom, happiness levels remain relatively low in some cases. This implies the presence of extra elements influencing a country's overall well-being in addition to freedom.
#Isolated rows from datasets to get the 10 saddest and 10 happiest countries
table1 = subset['Country name'].head(10)
table2 = df_originial['Country name'].head(10)
#Customization
table1_title = 'Top 10 Sadder countries'
table2_title = 'Top 10 Happier countries'
#Displayed
print(table1_title)
print(table1)
print('\n' + table2_title)
print(table2)
Top 10 Sadder countries 0 Afghanistan 1 Zimbabwe 2 Rwanda 3 Botswana 4 Lesotho 5 Malawi 6 Haiti 7 Tanzania 8 Yemen 9 Burundi Name: Country name, dtype: object Top 10 Happier countries 0 Finland 1 Denmark 2 Switzerland 3 Iceland 4 Netherlands 5 Norway 6 Sweden 7 Luxembourg 8 New Zealand 9 Austria Name: Country name, dtype: object
Before us lies a list of the top 10 saddest and happiest countries in 2021 according to the Dataset
I decided to add both these poles of the rankings into one histogram for better comparison. We are comparing life expectancies in countries such as Afghanistan and Lesotho as well as Switzerland or Iceland.
# Asked Chatgpt to help customize the colors on the barplot
custom_palette = ["red" if y < 65 else "green" for y in subset['Healthy life expectancy']]
ax = sns.barplot(x='Country name', y='Healthy life expectancy', data=subset, palette=custom_palette)
# Asked Chatgpt to rotate the countries on the x label so they would fit
plt.xticks(rotation=45)
plt.tight_layout()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
plt.xlabel('Country')
plt.ylabel('Life Expectancy')
plt.title('Life expectancy for top 10 Saddest & Happiest counntries', fontsize=14)
# Display
plt.show()
The comparison of life expectancy in happy and sadder countries indicates a significant gap. According to the graph, happier countries have greater life expectancies, reaching 70 years old, whilst sadder countries have much lower life expectancies, averaging at 55 years old. This disparity shows that the poorer countries may confront difficult situations such as limited access to key resources such as food and water, as well as proper medical treatment. This observation highlights the possibility of poverty-related problems existing inside these countries.
Hence, we must look into the financing of such regions...
# Created a list of every Region on the Data Set
Regions = df['Regional indicator'].unique().tolist()
Regions
['South Asia', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Middle East and North Africa', 'Southeast Asia', 'Commonwealth of Independent States', 'Central and Eastern Europe', 'East Asia', 'Western Europe', 'North America and ANZ']
#Used ChatGPT to help set all the boxplots in a grid
num_regions = len(Regions)
num_cols = 2
num_rows = (num_regions - 1) // num_cols + 1
# Set the characteristics of the subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, num_rows * 4))
# Used ChatGPT to create this loop to plot boxplots for each region
for i, region in enumerate(Regions):
df_iso = df.query('`Regional indicator` == @region')
row = i // num_cols
col = i % num_cols
ax = axes[row][col]
sns.boxplot(x='GPD per capita', data=df_iso, ax=ax)
ax.set_title(f'Boxplot for GPD per capita - {region}')
# Display
plt.tight_layout()
plt.show()
Further examination of the box plot focused on the Sub-Saharan region, which was determined as the saddest region based on the pie chart, revealed exceptionally low GPD per capita values, with an average of 3000 dollars and the majority lying between 2000 dollars and 5000 dollars. On the other hand, in better regions, such as Western Europe, GPD per capita averaged roughly 52000 dollars, with the bulk falling between 40000 dollars and 58000 dollars.
In conclusion, the findings strongly show that poverty is a key contributor to a country's overall misery. The observed variations in economic measures, such as GDP per capita, between unhappy and cheerful places underline the essential role that socioeconomic factors play in shaping well-being outcomes.
#Asked Chatgpt to help colorcode and display the figure
fig = go.Figure(data=go.Choropleth(locations=df['Country name'],
z=df['GPD per capita'],
locationmode='country names',
colorscale='YlOrRd_r',
colorbar_title='GPD per capita',))
fig.update_layout(title='GPD per capita by Country',
geo=dict(showframe=False, showcoastlines=False,),)
fig.show()
The displayed choropleth map depicts the global distribution of countries based on their GDP per capita in great detail. The color pattern, which ranges from dark red to bright yellow, depicts the economic spectrum, with darker shades signifying severely low GPD values (below 5000 dollars) and brighter colors denoting rich economies (over 100,000 dollars).
A close analysis of the chart reveals that nations in Africa and South Asia have darker hues, signifying serious economic issues and notably low GPD per capita. Countries in Europe, on the other hand, show tones of orange and yellow, signifying substantially higher economic well-being, with GPD per capita above 50,000 dollars.
The map improves our comprehension of the global distribution of economic prosperity and the striking contrast between places with varying GPD per capita levels by showing these discrepancies.
However we do need to know if such low levels or high levels of GDP is caused by governmental corruption
# average perceptions of corruption per country.
average_salaries = df.groupby('Country name')['Perceptions of corruption'].mean().reset_index()
# Creating the bar plot
fig = go.Figure(data=go.Bar(x=average_salaries['Country name'], y=average_salaries['Perceptions of corruption']))
# Customization
fig.update_layout(title='Level of Coruption Worlwide', xaxis_title='Country',yaxis_title='Perceptions of corruption')
# Display
fig.show()
The plot provides information about the prevalence of corruption in various countries. Each bar represents a country, and the height of the bar symbolizes the country's average level of perceived corruption. Higher bars imply a higher perception of corruption, whereas lower bars indicate a lesser perception.
We can identify countries with substantially higher or lower degrees of corruption by inspecting the plot. This image helps us comprehend the global distribution and differences in corruption perceptions, offering useful information for future analysis and research on this global topic.
As we can see coruption is everywhere. And it is difficult to identify whether it played the main role in a countries misery.
sns.barplot(x= subset['Country name'], y=subset['Social support'])
# Rotate x labels so they can fit
plt.xticks(rotation=45)
# Customization
plt.xlabel('Country')
plt.ylabel('Social support')
plt.title('Social support by Country')
Text(0.5, 1.0, 'Social support by Country')
The analysis of the data reveals a significant discrepancy in social support between the top ten happiest and saddest countries. When compared to the happy countries, the sad countries have much lower levels of social support, frequently less than half. This disparity in social assistance shows that the sadder countries' lack of appropriate support structures may contribute to their overall experience of misery.
As a data consultant for Action for Happiness, I did an exploratory data study to discover insights about population wellbeing and the elements that contribute to it. Several major findings resulted from this inquiry:
However,the dataset is based on self-reported data. Furthermore, the approach is based on correlational links rather than causation. More study and data collection activities would be good for gaining a more comprehensive understanding of the elements that influence happiness.
Furthermore, the identification of specific regions with lower happiness scores, such as Sub-Saharan Africa and South Asia, emphasizes the importance of focused interventions in these countries. Investing in poverty reduction, education, and healthcare programs could boost people's well-being and happiness in these areas.
Healthcare and Well-Being: Advocate for better medical systems and facilities. This could include collaborations with healthcare groups, policy advocacy, and public awareness campaigns.
Intervention: Concentrate on designing and implementing tailored intervention programs in regions with lower happiness levels, taking into account each region's unique socioeconomic and cultural environment.
Constant Data Monitoring: By implementing an EDA pipeline every year, we aid Action for Happiness to track progress over time. They can gather and analyze data on a regular basis to evaluate the success of their initiatives and track changes in happiness levels. This monitoring enables them to change their plans based on real-time input and ensure they are having a beneficial influence on the well-being of individuals.
Action for Happiness can help to create a happier world by addressing major variables that influence happiness and increasing the well-being of individuals and communities by implementing these recommendations.
from IPython.display import Image
Image(filename='DATA VISUALIZATION.jpg')